2026-01-15
All models are wrong, but how they’re wrong matters.
Up to this point, we’ve mostly focused on more banal kinds of wrongness:
Issues like heteroskedasticity, overdispersion, non-normality etc. of a regression can distort p.values and confidence intervals.
Using a linear model for a non-linear relationship can bias regression coefficients and give us inaccurate or nonsensical predictions
Autocorrelation can it look like we have a lot of data when we actually have a very small number of independent observations.
Violating these assumptions matters, but we have relatively easy fixes.
All models are wrong, but the importance of that wrongness depends on our goals:
If our primary goals are prediction or description, we might care very little about spurious correlation
If our primary goal is prescription or explanation, then we probably care a lot about spuriousness:
Up to now, we’ve mostly been focusing on those more banal problems, but for the latter part of the course we’ll be talking about the more difficult problem of identifying causal relationships like:
do get out the vote campaigns cause people to vote?
does a person’s race cause police to treat them differently?
do housing first policies cause a reduction in homelessness?
Causal claims rely on counterfactuals: “if X, then Y” implies “If not X, then not Y”
But we never actually observe counterfactuals!
Causal claim: Humphry Bogart got cancer from cigarettes
The more generic version of this claim isn’t necessarily any easier to prove:
This should be a familiar problem! We talk about endogeneity or spurious correlation all the time
In observational research, we can try to address this with control variables, but those may not be adequate for really complex confounding.
Causal claims rest on something we never observe.
A more pessimistic view is that this is basically unsolvable:
A somewhat more optimistic view is that we can solve this for aggregate probabilistic claims if some very strict assumptions are met.
Years lived if you smoke \[Y_i(1)\] Years lived if you quit \[Y_i(0)\] Effect of smoking vs. quitting:
\[Y_i(1) - Y_i(0)\]
| Subject | Smoker | Quitter | difference |
|---|---|---|---|
| A | 60 | 71 | 11 |
| B | 72 | 70 | -2 |
| C | 72 | 84 | 12 |
| D | 71 | 60 | -11 |
| E | 72 | 75 | 3 |
| F | 52 | 64 | 12 |
| G | 70 | 80 | 10 |
| Average | 67 | 72 | -5 |
| Subject | Smoker | Quitter | Difference |
|---|---|---|---|
| A | ?? | 71 | ?? |
| B | 72 | ?? | ?? |
| C | 72 | ?? | ?? |
| D | 71 | ?? | ?? |
| E | 72 | ?? | ?? |
| F | 52 | ?? | ?? |
| G | ?? | 80 | ?? |
| Average | 68 | 76 | -7.5 |
Imagine we have some treatment \(D\) that reliably induces people to stop smoking.
Then we need to estimate the value (the mean) for subject \(i\) conditional on assigning treatment \(D\)
\[E[Y_i(1)|D_i = 1]\]
And also their expected value conditional on **not being assigned to the treatment group*. And we can’t observe both of these simultaneously.
\[E[Y_i(0)|D_i = 1]\]
We want the expected years of life for individual i if they smoke compared to expected years of life for non-smokers:
\[E[Y_i(1) - Y_i(0)]\]
But we only have expected years of life for each group separately conditional on a treatment
\[E[Y_i(1)|D_i=1], E[Y_i(0)|D_i=0]\]
If we can assume the treatment assignment \(D_i\) is random and thus uncorrelated (\(\unicode{x2AEB}\)) with any other predictors of life expectancy \(X_i\)
\[Y_i(0), Y_i(1), X_i \mathrel{\unicode{x2AEB}} D_i\]
…then the conditional expectation is the same as the unconditional expectation and the effect is just the difference of means between each group (plus some random error)
\[E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=0] = E(Y_1 - Y_0)\]
This is a roundabout way of saying that the problem of causal inference is solvable in the aggregate if and only if the “cause” is uncorrelated with any other characteristic that influences the outcome. If those hold, then a simple regression model or difference of means test can identify a causal relationship!
graph LR A[Lifestyle] B[Smoking] C[Cancer] A-->B A-->C B-->C
Notably, we don’t need to account for everything that impacts life expectancy here. We just need to ensure that those other predictors are not correlated with the treatment we’re interested in.
graph LR A[Lifestyle] B[Smoking] C[Cancer] A-->C B-->C
No confounding we can’t observe the counterfactual for any individual, but we can infer an average counterfactual for groups provided we can assume that the treatment is uncorrelated with any confounders \(Y_i(0), Y_i(1), X_i \mathrel{\unicode{x2AEB}} D_i\)
The excludability assumption requires that the treatment itself is is the only thing that impacts the outcome (so we need to rule out things like placebo effects)
The non-interference assumption (aka the Stable Unit Treatment Value Assumption or SUTVA) assumes that treatment assignment for one unit doesn’t impact the others (for instance, if people in the treatment group influence people in the control group)
Average Treatment Effect (ATE) the average difference between \(Y_i(1) - Y_i(0)\)
Average Treatment Effect on the treated (ATT) \(E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=1]\)
Average Treatment Effect on the untreated (ATU) \(E[Y_i(1)|D_i=0] - E[Y_i(0)|D_i=0]\)
In the idealized scenario, these should be equivalent, but in practice they will likely diverge due to things like heterogeneous treatment effects, non-compliance, and imbalance between treatment and control units.
Experiments are the most straightforward way to satisfy the “no confounding” assumption, despite their limitations.
Sides and Citrin (2007): people who overestimate immigration numbers tend to have more negative attitudes towards immigration. But is this a causal relationship?
We could have multiple kinds of confounding here, including the possibility that anti-immigration attitudes impact misperceptions of immigrant numbers (simultaneous causation)
graph LR A[low information etc.] B[overestimating immigrant numbers] C[anti-immigrant attitudes] A==>B A<==>C B-->C
With confounding, the difference between “overestimators” compared to people with accurate perceptions:
\[\underbrace{E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{the difference between overestimators and non-overestimators}\]
Is actually a combination of the actual effect and the effect of confounding:
\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{actual effect of overestimating} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{the effect of other stuff correlated with overestimation}\] We may not even know what all of the “other stuff” is, so including more controls may not solve this.
Hopkins, Sides and Citrin (2018): The Muted Consequences of Correct Information about Immigration
Question: People consistently overestimate immigrant populations. Does giving them correct information about immigration levels influence their attitudes?
Method: a survey experiment. Randomly assign some survey respondents to receive correct information about immigration levels before asking them their views.
Observational research still has this same basic problem, if we don’t talk about “treatment” in the same way:
Hypothesis: Fox News viewers are less likely to get the Covid vaccine.
Issue: the \(E[Y_i(1)]\) for Fox News viewers is not the same as \(E[Y_i(1)]\) for non-Fox viewers. The “treatment” and “control” groups are different for all sorts of reasons. So \[\underbrace{E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{difference between viewers and non-viewers}\] Is now a combination of:
\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{treatment effect for Fox News viewers} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{effect of predisposition to watch Fox News}\]
Or the role of racial bias in motivating traffic stops
\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{racial discrimination effect} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{non-racial profiling and actual rate of minor traffic violations}\]
Or fears of being drafted on war attitudes
\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{fear of being drafted} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{pre-existing attitudes about war}\]
Typically, we try to account for this sort of thing using control variables. But cramming stuff in can also introduce problems:
The inclusion of bad controls (colliders) can actually make bias worse rather than better.
To get a better sense of this, we need to take a brief detour into DAGs
Directed Acyclic Graphs: describe relevant causal relationships between a IV and a DV of interest.
Nodes are variables: Z, X, and Y
Arrows (aka edges) indicate causal relationships so \(X\rightarrow Y\), \(Z \rightarrow Y\) and \(Z \rightarrow X\)
A path is anything you can draw to connect two nodes. So \(X \rightarrow Y\) and \(X \leftarrow Z \rightarrow Y\) are both paths that could lead from X to Y
Our goal is to only have paths going from right to left for X on Y by “closing all open backdoor paths”, usually by conditioning on those variables. So here, we want to control for Z to close \(X \leftarrow Z \rightarrow Y\)
Including a control variable in a regression is one of several ways to condition, but it requires us to measure Z and include it in the model.
Unobserved confounding (sometimes represented using dashed lines) can be addressed by randomization, but not by regression
Collision happens when two variables along the causal path from X to Y point to the same node.
So the collider here is \(X \rightarrow Z \leftarrow Y\)
Collider paths are “closed” on their own. So, unlike the confounding case, conditioning on these paths causes bias rather than reducing it.
Collider bias is really more general version of the selection bias problem. Recall that selection bias occurs when we only see outcomes above a certain threshold. For instance: when low expectations of earnings cause people to drop out of the sample.
Since these outcomes are unobserved, models looking at this group are stratified based on the collider
The selection problem frames this as a function of a failure to consider a group, but it could also be thought of as a form of conditioning on potential outcomes: the probability of seeing certain observations depends on \(E[Y_i(1)]\).
| true model | without collider | with collider | |
|---|---|---|---|
| (Intercept) | 9.267 | 68.450 *** | -5.578 |
| (6.063) | (5.962) | (4.686) | |
| educ_year | 9.453 *** | 6.594 *** | 6.028 *** |
| (0.390) | (0.367) | (0.326) | |
| lfpin labor force | 82.980 *** | ||
| (3.141) | |||
| N | 1000 | 800 | 1000 |
| R2 | 0.370 | 0.288 | 0.630 |
| logLik | -5279.006 | -4043.012 | -5013.666 |
| AIC | 10564.013 | 8092.024 | 10035.333 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
Selection problems can cause biased estimates. Including a control for a collider has a similar impact.
This might seem like an odd thing to do, but it actually comes up a lot!
For instance in discussions of discrimination: people will advocate for examining wage disparities after controlling for things that (like job title) that are themselves downstream from discrimination and wages.
All paths from discrimination to earnings:
Discrimination \(\rightarrow\) Earnings (direct effect)
Discrimination \(\rightarrow\) Job title \(\rightarrow\) Earnings (mediated effect)
Discrimination \(\rightarrow\) Job title \(\leftarrow\) Ability \(\rightarrow\) Earnings (collider)
What happens if we dis-aggregate by position and then compare wages? Why?
One way to think about this is in terms of potential outcomes: the women who get promoted - if not for discrimination - would have already have a higher expected wage than the men. So when you stratify on job title, you end up comparing women who would have high wages if not for discrimination to men who would have lower wages if not for discrimination:
For colliders \(\rightarrow Z \leftarrow\) the best approach is actually to do nothing. They’re already “closed”
On the other hand, we do need to control for confounding \(\leftarrow Z \rightarrow\), but this assumes we can measure it.
Randomization allows for true causal inference, but its often not an option
Regression can do this, but only under very strict conditions and there’s a real risk of garbage can models making things worse.
Quasi-experimental methods look for ways to ensure that treatment is independent of potential outcomes, or, barring that, that potential outcomes are balanced between treated and control units.